Our team sought to answer an intriguing question: Can we predict the political party of twitter users from the words they tweet? After some discussion, we narrowed this question down to inference on current members of Congress. To this end, we utilized the Twitter API to gather the last year’s tweets from all Senate and House members. Taking a random sample of tweets, we distilled this huge mine of information into the density of the words used by each user, as a ratio compared to the person who used the word the most often.
With this data we performed unsupervised learning techniques like principal component analysis (PCA) and clustering, as well as supervised techniques like logistic regression and the random forest. Our aim in performing these methods was to infer from the data. By making models with improved predictive capability, we can gather more acute insights into the structure of the data and the language associated to political party.
Note: in general, the data transforms used take a while to run, so we pre-load the transformed data and only run the necessary code.
All files referenced in this section are in the DataCollection folder. Our sources for data collection are the two files representatives.txt and senators.txt, taken from the GWU Libraries Dataverse. These files contain the last 3,200 tweets from every member of the 115th Congress (the current session), excepting four members of the House who don’t have official Twitter accounts: Collin Peterson (D-MN-07), Lacy Clay (D-MO-01), Madeline Bordallo (Guam delegate), and Gregorio Sablan (Northern Mariana Islands delegate). Each of these files is a list of tweet IDs, which uniquely identify tweet objects in the Twitter API. Metadata about how user accounts were identified is stored in the corresponding README files. Using the script get_twitter_data.py, we pulled down a random sample of 10,001 tweets from the House of Representatives (10001_house.zip) and 50,000 tweets from the Senate (50000_senate.zip).
Our second data set is legislators-current.csv, which contains (among other variables) the following information on all current members of Congress: name, state, chamber (House or Senate), district (if House), party, website, and social media account names. We use this data set to identify the political party of each twitter account in the data set. Because this file comes from a different source than our twitter data and some politicians use multiple twitter accounts (for example, @POTUS versus @realDonaldTrump), some manual cleaning was needed to make sure all accounts in the twitter data set are present in the congress data set. In the script add_congress_data.R, we “fill in” this information, which mostly ended up just being replacements with different capitalization.
Now that we have two data sets that completely match on twitter username, we can transform the data into the form we want. The json_to_df.R script takes in the tweets as JSON files, extracts the information we’re interested in from each tweet, and creates a dataframe out of this. Each row of this dataframe is a tweet, and the columns are variables like tweet id, timestamp, text, and author. The tidy_text.R script parses out the content of the tweets and counts the occurrences of each word by user, scales each row and column, then joins this with the congress_df dataset to make full_data.RData. Each row of this dataset is a user, each column is a different word used, and the entries are scale proportions of how often a user used each word. For ease of computation, only words used by at least 10 distinct users were considered.
In the file make_plots.R, we plot some basic results of the data.
|
|
|
The top plot shows how often members of each party use each word on a log scale. For example, Republicans use the word “senate” about 0.6% of the time, where Democrats use it about 0.4% of the time. The red line represents equal usage between Democrats and Republicans. The bottom plot shows the log odds ratio log(Democrat usage/Republican usage) for the 15 words used most by each party compared to the other. Not all words can be shown in the first plot, so let’s break this up into a couple different categories.
|
|
|
While some of these make intuitive sense (more Democrats tag other Democrats, and vice versa), one interesting note is that Democrats tag both @housegop and @senatedems more, and Republicans are more likely to tag @foxnews, @foxbusiness, and @aipac (the American Israel Public Affairs Committee, a pro-Israel lobbying group).
|
|
|
In the use of hashtags, we see some opposites between the two parties: #obamacare vs. #trumpcare, #passthebill vs. #killthebill (in regards to the tax reform bill), #marchforlife and #pro_life vs. #istandwithpp. Some other perceived talking points of the two parties emerge: the Iran nuclear deal and the Keystone XL pipeline for Republicans, and climate change and the Trump-Russia investigation for Democrats.
|
|
|
The “regular words” (not hashtags or tagged users) used have a few more potentially uninteresting words (such as “morning”), but we can still see a few things:
Republicans tweet a lot about Obama, Democrats tweet a lot about Trump
Republicans prefer the word “obamacare”, Democrats prefer “aca”
It turns out that visualizing a data set with 4345 variables is tricky, to say the least. To get around this, we applied PCA to see what actually made a difference in the data.
d <- full_data[,-(1:2)]
pca1 <- prcomp(d)
pc_df <- data.frame(PC = 1:20,
PVE = pca1$sdev[1:20]^2 / sum(pca1$sdev[1:20]^2))
ggplot(pc_df, aes(x = PC, y = PVE)) +
geom_line() +
geom_point()
From the scree plot here, we can see that the first 3 PCs really account for the vast majority of the structure in the data.
scores_df <- data.frame(user = full_data$twitter,
party = full_data$party_id,
PC1 = pca1$x[,1],
PC2 = pca1$x[,2],
PC3 = pca1$x[,3],
PC4 = pca1$x[,4]) %>%
left_join(congress_df, by = c("user" = "twitter"))
loading_df <- data.frame(word = colnames(d),pca1$rotation[ ,1:4])
ggplot(scores_df, aes(x=PC1, y = party)) + geom_jitter()
The first principal component does a pretty good job of encoding what party the user belongs to, Democrat (-) or Republican (+).
ggplot(scores_df, aes(x=PC2, y = chamber_type)) + geom_jitter()
The second principal component appears to distinguish Representatives (+) from Senators (-).
kable(arrange(loading_df, desc(PC3))[c(1:10,4336:4345), ])
| word | PC1 | PC2 | PC3 | PC4 | |
|---|---|---|---|---|---|
| 1 | families | -0.1239068 | 0.0117368 | 0.2008988 | 0.0796565 |
| 2 | #trumpcare | -0.3533959 | 0.0060057 | 0.1374581 | 0.0205737 |
| 3 | #paymoreforless | -0.1578449 | 0.0385303 | 0.1311770 | 0.0086349 |
| 4 | seniors | -0.0913230 | 0.0124293 | 0.1084742 | 0.0452491 |
| 5 | #veteransday | 0.0270323 | -0.0755258 | 0.0879955 | -0.1819736 |
| 6 | hurt | -0.0749704 | 0.0211459 | 0.0856073 | 0.0204101 |
| 7 | coverage | -0.1592510 | 0.0102826 | 0.0851376 | 0.0453163 |
| 8 | advances | 0.0275588 | -0.0726092 | 0.0801361 | -0.1647117 |
| 9 | tie | 0.0242893 | -0.0590068 | 0.0733977 | -0.1526220 |
| 10 | #trumpcares | -0.0537211 | 0.0136856 | 0.0710925 | 0.0265834 |
| 4336 | #trumprussia | -0.0376904 | -0.0177762 | -0.0960597 | -0.0442235 |
| 4337 | investigate | -0.0281191 | -0.0817810 | -0.0988678 | 0.0606355 |
| 4338 | credible | -0.0300072 | -0.0098416 | -0.1010107 | -0.0293986 |
| 4339 | chairman | -0.0100436 | -0.0009404 | -0.1021721 | -0.0521272 |
| 4340 | conduct | -0.0272213 | -0.0198882 | -0.1037505 | -0.0148771 |
| 4341 | investigation | -0.0397529 | -0.0429024 | -0.1378704 | -0.0358674 |
| 4342 | independent | -0.1316147 | -0.0430895 | -0.1391207 | -0.0441689 |
| 4343 | committee | -0.0335960 | -0.0242492 | -0.1434073 | -0.0752751 |
| 4344 | russia | -0.0367527 | -0.0614352 | -0.1638510 | -0.0174157 |
| 4345 | nunes | -0.0746828 | -0.0205165 | -0.1785077 | -0.0933248 |
The third principal component weights users differently based on whether they talk more about health care (+) or the Russia investigation (-).
We expected the first one to be party, and spent a while trying to figure out what the second PC could be (but it makes sense that chamber shows up). The 3rd one, however, was the most surprising.
Below we’ve plotted a summary of the first two components, along with the non-text variables we think they best encode.
ggplot(scores_df,aes(x = PC1, y = PC2, color = party, shape = chamber_type)) +
geom_point() +
scale_color_manual(values=c("#619CFF","#00BA38","#F8766D")) +
scale_shape_manual(values=c(1,16))
km1 <- kmeans(d, centers = 1)
km2 <- kmeans(d, centers = 2, iter.max = 10, nstart = 20)
km3 <- kmeans(d, centers = 3, iter.max = 10, nstart = 20)
km4 <- kmeans(d, centers = 4, iter.max = 10, nstart = 20)
km5 <- kmeans(d, centers = 5)
km6 <- kmeans(d, centers = 6)
km7 <- kmeans(d, centers = 7)
bub <- data.frame(ClusterNumber = 1:7,
tot.within.ss = c(km1$tot.withinss,
km2$tot.withinss,
km3$tot.withinss,
km4$tot.withinss,
km5$tot.withinss,
km6$tot.withinss,
km7$tot.withinss
))
ggplot(bub, aes(x = ClusterNumber, y = tot.within.ss)) +
geom_line() +
geom_point()
This is a scree plot for the number of clusters applied to the data. No clear elbow exists in the plot, implying that there is no strong clustering of the data. Below we plot some of these clusters on the first two principal components.
cluster_df <-data.frame(party = scores_df$party,
chamber = scores_df$chamber_type,
PC1 = pca1$x[,1],
PC2 = pca1$x[,2],
k2 = km2$cluster, k3=km3$cluster, k4=km4$cluster)
ggplot(cluster_df, aes(x = PC1, y = PC2, color = as.factor(k2),shape=party)) +
geom_point() +
scale_shape_manual(values=c(1,17,16))
ggplot(cluster_df, aes(x = PC1, y = PC2, color = as.factor(k3))) + geom_point()
ggplot(cluster_df, aes(x = PC1, y = PC2, color = as.factor(k4))) + geom_point()
In this analysis, 2 clusters separate the parties, 3 clusters group the entire Senate together and split the House by party, and 4 clusters adds a mysterious 4th group (sometimes this splits the Senate into parties and sometimes it sprinkles group 4 throughout, it’s very variable). We can see how well the 2-clustering assigns to party:
conf <- table(cluster_df$k2, cluster_df$party)
kable(conf)
| Democrat | Independent | Republican |
|---|---|---|
| 57 | 1 | 269 |
| 178 | 1 | 0 |
If we consider it to be a “classification model”, the 2-cluster has a MCR of 0.1146245. Overall, the clustering seems to agree with our PCA in that the most identifiable feature is party, followed by chamber.
naive_mcr <- mean(scores_df$party != "Republican")
kable(scores_df %>% group_by(party) %>%
summarize(n = n(),
prop = n/506))
| party | n | prop |
|---|---|---|
| Democrat | 235 | 0.4644269 |
| Independent | 2 | 0.0039526 |
| Republican | 269 | 0.5316206 |
Our most naive model is just the mode, that every politician in the data set is a Republican. This gives us a misclassification rate of 0.4683794. Any model that can improve on this (not a hard task) will give us more insight into the data.
Because (with 2 exceptions) we’re seeking to classify into two parties, logistic regression makes sense. To fit the model, we remove the two independent senators (Bernie Sanders, VT; and Angus King, ME) from our data set. Because of the exceedingly large number of predictors, a restricted model with the lasso or ridge techniques is appealing. We use 5-fold cross-validation to prevent overfitting our models.
no_ind <- filter(full_data, party_id != "Independent") %>%
mutate(party_id = factor(as.character(party_id)))
logit_ridge <- glmnet(data.matrix(no_ind[ ,-(1:2)]), no_ind$party_id, family = "binomial", alpha=0)
ridge_grid <- exp(seq(0,5,length.out=50))
ridge_cv <- cv.glmnet(data.matrix(no_ind[ ,-(1:2)]), no_ind$party_id, family = "binomial", alpha=0, nfolds=5, type.measure = "class", lambda = ridge_grid)
ridge_bestlam <- ridge_cv$lambda.min
ridge_pred <- predict(logit_ridge, s = ridge_bestlam, newx=data.matrix(full_data[ ,-(1:2)]), type = "class")
ridge_mcr <- mean(ridge_pred != full_data$party_id)
logit_lasso <- glmnet(data.matrix(no_ind[ ,-(1:2)]), no_ind$party_id, family = "binomial", alpha=1)
lasso_grid <- exp(seq(-6,-2,length.out=50))
lasso_cv <- cv.glmnet(data.matrix(no_ind[ ,-(1:2)]), no_ind$party_id, family = "binomial", alpha=1, nfolds=5, type.measure = "class", lambda = lasso_grid)
lasso_bestlam <- lasso_cv$lambda.min
lasso_pred <- predict(logit_lasso, s = lasso_bestlam, newx=data.matrix(full_data[ ,-(1:2)]), type = "class")
lasso_mcr <- mean(lasso_pred != full_data$party_id)
plot(ridge_cv)
plot(lasso_cv)
Because both models perform better than \(\lambda=0\) (regular logistic regression), we can feel confident in choosing one of these over the full logistic model. For the dataset, our ridge MCR is 0.0375494, and our lasso MCR is 0.0039526. The lasso being better makes some intuitive sense, as we would expect some words to be meaningless for prediction. We can examine which words were non-zero in the lasso model:
kable(data.frame(word = colnames(full_data)[-(1:2)],
coeff = as.vector(predict(logit_lasso, s = lasso_bestlam, type = "coefficients"))[-1]) %>%
filter(coeff !=0) %>%
arrange(desc(coeff)))
| word | coeff |
|---|---|
| forward | 0.2333976 |
| obama | 0.1789967 |
| #obamacare | 0.0812549 |
| energy | 0.0285472 |
| savings | -0.0052260 |
| sj | -0.0113303 |
| corporations | -0.0149638 |
| deserve | -0.0156128 |
| bill | -0.0302886 |
| massive | -0.0551805 |
| bipartisan | -0.0577781 |
| critical | -0.0673239 |
| hate | -0.1971708 |
| ties | -0.1992535 |
| tear | -0.2071270 |
| predatory | -0.2208062 |
| environment | -0.2324100 |
| recreation | -0.2368057 |
| stem | -0.2720241 |
| demands | -0.2759973 |
| protect | -0.2779761 |
| -0.3401555 | |
| transparent | -0.3455557 |
| background | -0.3685572 |
| civil | -0.3800988 |
| #usa | -0.3821922 |
| deal | -0.4151496 |
| prioritize | -0.4274343 |
| @sencortezmasto | -0.4464951 |
| discrimination | -0.4782126 |
| sad | -0.4982090 |
| farming | -0.4989988 |
| @timkaine | -0.5070158 |
| default | -0.5304747 |
| million | -0.5305152 |
| recuse | -0.5458326 |
| afford | -0.6293077 |
| #paymoreforless | -0.6443399 |
| @housegop | -0.6542512 |
| medicare | -0.6599443 |
| people | -0.6661578 |
| @gop | -0.6766630 |
| bannon | -0.6868695 |
| tomorrows | -0.7152784 |
| backwards | -0.7180170 |
| services | -0.7305055 |
| constitution | -0.7397737 |
| trumpcare | -0.7580202 |
| @senrobportman | -0.7804709 |
| nunes | -0.8140484 |
| aca | -0.8315549 |
| acres | -0.8683048 |
| sens | -0.8809917 |
| environmental | -0.8810839 |
| shutdown | -0.8825484 |
| resignation | -0.9004789 |
| pulling | -0.9408735 |
| average | -0.9684261 |
| scott | -1.0238853 |
| @senfranken | -1.0275131 |
| @epa | -1.0299424 |
| fortune | -1.0425528 |
| coverage | -1.0486840 |
| robotics | -1.1021268 |
| foreign | -1.1889158 |
| voices | -1.2233117 |
| gop | -1.3639976 |
| homes | -1.3914316 |
| #actonclimate | -1.5335188 |
| interference | -1.5472966 |
| extreme | -1.5521337 |
| dont | -1.6551365 |
| tuned | -1.6667338 |
| task | -1.6878985 |
| transgender | -1.7027096 |
| independent | -1.7118178 |
| blame | -1.7548780 |
| americans | -1.7549179 |
| voting | -1.8344724 |
| internet | -1.8748075 |
| pruitts | -1.9214229 |
| hour | -1.9845166 |
| #equalpayday | -1.9969170 |
| fargo | -2.0660914 |
| trumps | -2.2601759 |
| seniors | -2.2677727 |
| package | -2.3519416 |
| partisan | -2.4587795 |
| cut | -2.7464459 |
| dem | -2.9027135 |
| gops | -3.0319127 |
| #aca | -3.0869864 |
| oppose | -3.2637105 |
| overdose | -3.2804956 |
| pay | -3.3137125 |
| base | -3.5131112 |
| trump | -4.1334327 |
| unacceptable | -4.1415088 |
| #broadbandprivacy | -4.5702820 |
| #trumpcare | -6.8340150 |
In this list we can see some of the same topics that we found through PCA analysis, like health care and the Russia investigation, as well as net neutrality and the EPA/climate change. Judging by the magnitude of the coefficients on each side of 0, more words typically used by Democrats (as determined by our exploratory data analysis) were important in deciding what party a user belonged to.
We can also check which users were misclassified by the lasso and ridge models.
lasso_missed <- data.frame(twitter = full_data$twitter,
state = scores_df$state,
party = full_data$party_id,
pred = lasso_pred,
prob = as.vector(predict(logit_lasso, s = lasso_bestlam, newx=data.matrix(full_data[ ,-(1:2)]), type = "response")),
stringsAsFactors = FALSE) %>%
filter(party != X1) %>%
arrange(desc(prob))
kable(lasso_missed)
| state | party | X1 | prob | |
|---|---|---|---|---|
| SenAngusKing | ME | Independent | Republican | 0.8576747 |
| SenSanders | VT | Independent | Democrat | 0.1172869 |
ridge_missed <- data.frame(twitter = full_data$twitter,
state = scores_df$state,
party = full_data$party_id,
pred = ridge_pred,
prob = as.vector(predict(logit_ridge, s = ridge_bestlam, newx=data.matrix(full_data[ ,-(1:2)]), type = "response")),
stringsAsFactors = FALSE) %>%
filter(party != X1) %>%
arrange(desc(prob))
kable(ridge_missed)
| state | party | X1 | prob | |
|---|---|---|---|---|
| repdavidscott | GA | Democrat | Republican | 0.5826187 |
| SenAngusKing | ME | Independent | Republican | 0.5779478 |
| RepGonzalez | TX | Democrat | Republican | 0.5712017 |
| RepAlGreen | TX | Democrat | Republican | 0.5631460 |
| RepSinema | AZ | Democrat | Republican | 0.5514052 |
| AnthonyBrownMD4 | MD | Democrat | Republican | 0.5382995 |
| RepBetoORourke | TX | Democrat | Republican | 0.5323220 |
| RepJimCosta | CA | Democrat | Republican | 0.5312471 |
| RepDerekKilmer | WA | Democrat | Republican | 0.5282319 |
| Sen_JoeManchin | WV | Democrat | Republican | 0.5174060 |
| SenatorTester | MT | Democrat | Republican | 0.5140082 |
| RepOHalleran | AZ | Democrat | Republican | 0.5125567 |
| SenDonnelly | IN | Democrat | Republican | 0.5113260 |
| RepStephMurphy | FL | Democrat | Republican | 0.5090062 |
| MarkWarner | VA | Democrat | Republican | 0.5085220 |
| RepJoshG | NJ | Democrat | Republican | 0.5067471 |
| SenatorHeitkamp | ND | Democrat | Republican | 0.5062373 |
| RepTomSuozzi | NY | Democrat | Republican | 0.5004556 |
| SenSanders | VT | Independent | Democrat | 0.2426629 |
Independents Angus King and Bernie Sanders both caucus with the Democrats, so we can consider Senator Sanders’ classification correct. Of note is that both of our logistic models only misclassified Democrats as Republicans! In addition, many of these congresspeople are Democratic legislators from majority Republican states like West Virginia, Texas, and Georgia.
Plotting these missed users among all the points, we see that most of them are Democrats grouped in the Republican cloud to the right. This seems to agree with our earlier statement that PC1 encodes party.
binary <- mutate(no_ind, party_id = ifelse(party_id == "Democrat", 0, 1))
boost_tweet <- gbm(party_id ~ .-twitter, data = binary,
n.trees = 1000,
shrinkage = 0.03)
## Distribution not specified, assuming bernoulli ...
boost_pred <- predict(boost_tweet,
newdata = full_data,
n.trees = 1000,
type = "response") > .5
boost_pred <- replace(boost_pred, boost_pred==TRUE, "Republican")
boost_pred <- replace(boost_pred, boost_pred==FALSE, "Democrat")
boost_mcr <- mean(boost_pred != full_data$party_id)
kable(head(summary(boost_tweet),20))
| var | rel.inf | |
|---|---|---|
#trumpcare |
#trumpcare |
44.9478384 |
#aca |
#aca |
5.3611618 |
| coverage | coverage | 3.3904559 |
| million | million | 2.2027373 |
| trump | trump | 1.9996101 |
| obamacare | obamacare | 1.8571873 |
| protect | protect | 1.8022129 |
| people | people | 1.7743480 |
| seniors | seniors | 1.5678119 |
| aca | aca | 1.5350070 |
| oppose | oppose | 1.5235915 |
| bill | bill | 1.4307878 |
| unacceptable | unacceptable | 1.3958775 |
| voting | voting | 1.3830423 |
| gop | gop | 1.2531726 |
| aisle | aisle | 1.0732624 |
| introduced | introduced | 0.9968308 |
| ties | ties | 0.9689715 |
#broadbandprivacy |
#broadbandprivacy |
0.9048346 |
| dont | dont | 0.7809643 |
Our boosted tree’s MCR is 0.013834. In the boosted tree, we really see #trumpcare stand out in variable importance, as well as themes like health care and net neutrality.
boost_missed <- data.frame(twitter = full_data$twitter,
state = scores_df$state,
party = full_data$party_id,
pred = boost_pred,
prob = as.vector(predict(boost_tweet,
newdata = full_data,
n.trees = 1000,
type = "response")),
stringsAsFactors = FALSE) %>%
filter(party != pred) %>%
arrange(desc(prob))
kable(boost_missed)
| state | party | pred | prob | |
|---|---|---|---|---|
| SenAngusKing | ME | Independent | Republican | 0.9208607 |
| RepGonzalez | TX | Democrat | Republican | 0.7459837 |
| AnthonyBrownMD4 | MD | Democrat | Republican | 0.6341021 |
| RepAlGreen | TX | Democrat | Republican | 0.5816371 |
| repdavidscott | GA | Democrat | Republican | 0.5426301 |
| RepJimCosta | CA | Democrat | Republican | 0.5236780 |
| SenSanders | VT | Independent | Democrat | 0.2046912 |
We see that many of the same congresspeople get misclassified in the boosted tree as in the logistic models.
We attempted regular bagging as well, but that was computationally infeasible.
tweet_rf <- randomForest(data.matrix(no_ind[ ,-(1:2)]), no_ind$party_id, importance = TRUE)
rf_pred <- predict(tweet_rf, newdata = data.matrix(full_data[ ,-(1:2)]), type = "response")
rf_mcr <- mean(rf_pred != as.character(full_data$party_id))
Our random forest misclassification rate is 0.0059289. This is slightly higher than the lasso, but still a good bit better than the ridge model.
importance <- data.frame(tweet_rf$importance) %>%
rownames_to_column() %>%
arrange(desc(MeanDecreaseAccuracy))
kable(head(importance, 20))
| rowname | Democrat | Republican | MeanDecreaseAccuracy | MeanDecreaseGini |
|---|---|---|---|---|
| #trumpcare | 0.0127390 | 0.0468019 | 0.0306329 | 9.713332 |
| #paymoreforless | 0.0054547 | 0.0248032 | 0.0157184 | 5.129080 |
| #aca | 0.0005295 | 0.0242226 | 0.0130932 | 4.864742 |
| #broadbandprivacy | 0.0013899 | 0.0175384 | 0.0099967 | 3.892037 |
| oppose | 0.0025342 | 0.0143878 | 0.0088295 | 3.107782 |
| #protectourcare | 0.0007879 | 0.0156880 | 0.0087919 | 3.221491 |
| coverage | 0.0032383 | 0.0132550 | 0.0085739 | 4.243509 |
| independent | 0.0010066 | 0.0118849 | 0.0067837 | 2.808360 |
| lose | 0.0013026 | 0.0110005 | 0.0064029 | 3.225091 |
| voting | 0.0018554 | 0.0103140 | 0.0063605 | 4.000431 |
| million | 0.0007622 | 0.0097789 | 0.0056329 | 3.044651 |
| cuts | 0.0004832 | 0.0100751 | 0.0056108 | 2.070664 |
| americans | 0.0029419 | 0.0074259 | 0.0053520 | 3.337898 |
| republicans | 0.0009553 | 0.0086694 | 0.0050252 | 2.743931 |
| aca | 0.0012165 | 0.0077063 | 0.0047266 | 2.035161 |
| wealthy | 0.0000836 | 0.0083623 | 0.0044675 | 1.950740 |
| @housegop | 0.0003081 | 0.0077733 | 0.0042920 | 2.235260 |
| seniors | 0.0013829 | 0.0067745 | 0.0042918 | 2.214019 |
| commission | 0.0001714 | 0.0074168 | 0.0039913 | 1.874837 |
| gop | 0.0006458 | 0.0067122 | 0.0039435 | 2.574212 |
Many of the same words from earlier appear to have high variable importance in the random forest we fit. The most important words here also correspond to words that are most often used by Democrats, which is interesting.
rf_missed <- data.frame(twitter = full_data$twitter,
state = scores_df$state,
party = full_data$party_id,
pred = rf_pred,
prob = predict(tweet_rf,
newdata = data.matrix(full_data[ ,-(1:2)]),
type = "prob")[ ,2],
stringsAsFactors = FALSE) %>%
filter(party != as.character(pred)) %>%
arrange(desc(prob))
kable(rf_missed)
| state | party | pred | prob | |
|---|---|---|---|---|
| SenAngusKing | ME | Independent | Republican | 0.750 |
| RepGonzalez | TX | Democrat | Republican | 0.564 |
| SenSanders | VT | Independent | Democrat | 0.140 |
In addition to the two Independents, the forest misclassified Rep. Vicente González of Texas’ 15th Congressional District.
To recap, our models’ misclassification rates were:
Our first 3 PCs encoded party, chamber, and whether a user talked more about health care or the Russia investigation, respectively. Our clustering grouped first by party, then by chamber.
We can think about which members of Congress our non-naive models found it harder to classify.
missed <- data.frame(missed = unique(c(ridge_missed$twitter, lasso_missed$twitter, boost_missed$twitter, rf_missed$twitter))) %>%
left_join(congress_df, by = c("missed" = "twitter"))
kable(missed)
| missed | last_name | first_name | chamber_type | state | party_id |
|---|---|---|---|---|---|
| repdavidscott | Scott | David | rep | GA | Democrat |
| SenAngusKing | King | Angus | sen | ME | Independent |
| RepGonzalez | Gonzalez | Vicente | rep | TX | Democrat |
| RepAlGreen | Green | Al | rep | TX | Democrat |
| RepSinema | Sinema | Kyrsten | rep | AZ | Democrat |
| AnthonyBrownMD4 | Brown | Anthony | rep | MD | Democrat |
| RepBetoORourke | O’Rourke | Beto | rep | TX | Democrat |
| RepJimCosta | Costa | Jim | rep | CA | Democrat |
| RepDerekKilmer | Kilmer | Derek | rep | WA | Democrat |
| Sen_JoeManchin | Manchin | Joe | sen | WV | Democrat |
| SenatorTester | Tester | Jon | sen | MT | Democrat |
| RepOHalleran | O’Halleran | Tom | rep | AZ | Democrat |
| SenDonnelly | Donnelly | Joe | sen | IN | Democrat |
| RepStephMurphy | Murphy | Stephanie | rep | FL | Democrat |
| MarkWarner | Warner | Mark | sen | VA | Democrat |
| RepJoshG | Gottheimer | Josh | rep | NJ | Democrat |
| SenatorHeitkamp | Heitkamp | Heidi | sen | ND | Democrat |
| RepTomSuozzi | Suozzi | Thomas | rep | NY | Democrat |
| SenSanders | Sanders | Bernard | sen | VT | Independent |
Again, we ignore Senator Sanders’ misclassification because he is considered farther to left than the rest of the Democratic party and caucuses with Democrats. Many of the people misclassified are members of the Blue Dog Coalition, a House Caucus of “fiscally-responsible Democrats” who are traditionally more conservative than the party in general.
blue_dogs <- c("Costa", "Cuellar", "Lipinski", "Bishop", "Cooper", "Correa", "Cirst", "Gonzalez", "Gottheimer", "Murphy", "O’Halleran", "Peterson", "Schneider", "Schrader", "Scott", "Sinema", "Thompson", "Vela")
kable(filter(missed, !(last_name %in% c(blue_dogs, "Sanders"))))
| missed | last_name | first_name | chamber_type | state | party_id |
|---|---|---|---|---|---|
| SenAngusKing | King | Angus | sen | ME | Independent |
| RepAlGreen | Green | Al | rep | TX | Democrat |
| AnthonyBrownMD4 | Brown | Anthony | rep | MD | Democrat |
| RepBetoORourke | O’Rourke | Beto | rep | TX | Democrat |
| RepDerekKilmer | Kilmer | Derek | rep | WA | Democrat |
| Sen_JoeManchin | Manchin | Joe | sen | WV | Democrat |
| SenatorTester | Tester | Jon | sen | MT | Democrat |
| SenDonnelly | Donnelly | Joe | sen | IN | Democrat |
| MarkWarner | Warner | Mark | sen | VA | Democrat |
| SenatorHeitkamp | Heitkamp | Heidi | sen | ND | Democrat |
| RepTomSuozzi | Suozzi | Thomas | rep | NY | Democrat |
Of the remaining members, many come from rural, southern, or typically Republican states, and our models may have picked up on some of the topics they tweet about that line up more with Republicans’ tweets. One thing of note is that every model misclassified Maine Senator Angus King as a Republican, even though he is an Independent who caucuses with the Democrats. Maine has a history of strong independent parties, and King is a former Democrat who left the party before running for governor (against Susan Collins, the other current senator from Maine). Upon leaving the party, King stated that “The Democratic Party as an institution has become too much the party that is looking for something from government,” indicating that he has some views sympathetic with Republicans (or at least dissimilar to Democrats).
In terms of variable importance, our models noted many of the same words that we saw in our exploratory data analysis. #trumpcare was almost always the most important variable, and important topics included health care, the Russia investigation, and net neutrality. Most of the words considered “important” were words used more often by Democrats, which is interesting. One reason for this might be that Democrats occupied a wider range of scores on PC1, indicating that their tweets were more dissimilar and thus harder to classify.
While our research question just focused on inference, originally we set out to build a predictive model. Because politicians’ official Twitter accounts use such different words than “regular people”, however, we would have either had to
create a database of regular Twitter users whose political affiliation we knew, or
only build a predictive model for politicians.
We found logistical and ethical issues with the first option, and didn’t see the point in building the second model, as almost all politicians list their party openly on their twitter account. This second method could be used, however, to maybe predict how a “non-partisan” elected official would actually act in practice. Different data would maybe be required to fit that specific need.
Because we were only interested in inference, we didn’t see as much of a need to worry about overfitting or model validation as if we were building predictive models. To verify the inferences we made, another data set could be created and used to test our models. However, this test data set, although it would certainly contain different tweets from our training data, would have tweets from the same people as our training data. The two data sets would not be independent in this way. Because our data set for fitting models had a relatively low ratio of observations to predictors (< 1/4), we decided to use all the data we had collected rather than just a random sample of users.
Littman, Justin, 2017. “115th U.S. Congress Tweet Ids”, Harvard Dataverse, V1, http://dx.doi.org/10.7910/DVN/UIVHQR.
Repository “congress-legislators” in GitHub group “unitedstates”. https://theunitedstates.io/congress-legislators/legislators-current.csv.